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Abstract 

In this paper we introduce the class of stationary prediction strategies 
and construct a prediction algorithm that asymptotically performs as well 
as the best continuous stationary strategy. We make mild compactness 
assumptions but no stochastic assumptions about the environment. In 
particular, no assumption of stationarity is made about the environment, 
and the stationarity of the considered strategies only means that they do 
not depend explicitly on time; we argue that it is natural to consider only 
stationary strategies even for highly non-stationary environments. 



1 Introduction 

This paper belongs to the area of learning theory that has been variously referred 
to as prediction with expert advice, competitive on-line prediction, prediction of 
individual sequences, and universal on-line learning; see |Zj for a review. There 
are many proof techniques known in this field; this paper is based on Kalnishkan 
and Vyugin's Weak Aggregating Algorithm JHj , but it is possible that some of 
the numerous other techniques could be used instead. 

In Section |2 we give the main definitions and state our main results, The- 
orems their proofs are given in Sections |31 El I n Section [7] we informally 
discuss the notion of stationarity, and Section |S] concludes. 



2 Main results 

The game of prediction between Predictor and Reality is played according to 
the following protocol (of perfect information, in the sense that either player 
can see the other player's moves made so far). 



Prediction protocol 

Reality announces (. . . , x_i, y~i, xq, yo) G (X x Y) 
FOR n — 1,2,...: 
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Reality announces x n € X. 
Predictor announces 7„ £ T. 
Reality announces y n G Y. 
END FOR. 

After Reality's first move the game proceeds in rounds numbered by the positive 
integers n. At the beginning of each round n. = 1,2,... Predictor is given some 
signal x n relevant to predicting the following observation y n . The signal is 
taken from the signal space X and the observations from the observation space 
Y. Predictor then announces his prediction 7„, taken from the prediction space 
r, and the prediction's quality in light of the actual observation is measured by 
a loss function A : V x Y — > K. At the beginning of the game Reality chooses 
the infinite past, {x n ,y n ) for all n < 0. 

In the games of prediction traditionally considered in machine learning 
there is no infinite past. This situation is modeled in our framework by ex- 
tending the signal space and observation space by new elements ? G X and 
? G Y, defining A(7, ?) arbitrarily, and making Reality announce the infinite 
past (. . . , X-i, y-i,xo, y<j) = (...,?,?,?, ?) and refrain from announcing x n = ? 
or y n = ? afterwards (intuitively, ? corresponds to "no feedback from Reality"). 

We will always assume that the signal space X, the prediction space T, 
and the observation space Y are non-empty topological spaces and that the 
loss function A is continuous. Moreover, we are mainly interested in the case 
where X, T, and Y are locally compact metric spaces, the prime examples being 
Euclidean spaces and their open and closed subsets. Our first results will be 
stated for the case where all three spaces X, T, and Y are compact. 

Remark Our results can be easily extended to the case where the loss on 
the nth round is allowed to depend, in addition to 7„ and y n , on the past 
. . . , x n -i, J/n-ii x n - This would, however, complicate the notation. 

Predictor's strategies in the prediction protocol will be called prediction 
strategies (or prediction algorithms, when they are defined explicitly and we 
want to emphasize this). Mathematically such a strategy is a function D : 
(X x Y)°° X X X {1,2, ...}—► T; it maps each history (. . . , x n _i, y n -i, x n ) 
and the current time n to the chosen prediction. In this paper we will only be 
interested in continuous prediction strategies D (according to the traditional 
point of view |22| . going back to Brouwer, only continuous prediction strategies 
can be computable; although it should be mentioned that nowadays there are 
influential definitions of computability [SJO] not requiring continuity). An espe- 
cially natural class of strategies is formed by the stationary prediction strategies 
D : (X x Y)°° x X — > r, which do not depend on time explicitly; since the 
origin of time is usually chosen arbitrarily, this appears a reasonable restriction 
(see Section [7| for a further discussion). 

Universal prediction strategies: compact deterministic case 

In this and next subsections we will assume that the spaces X, T,Y are all 
compact. A prediction strategy is CS universal for a loss function A if its 
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predictions j n satisfy 

/ 1 N 1 N \ 

limsupl — A(7„,y„) - — A(Z)(. . . ,x„_ 1 ,y n _ 1 ,a; n ),y n ) J <0 (1) 

Ar_> °° \ n=l n=l ) 

for any continuous stationary prediction strategy D and any biinfmite 

. . . , x_±, D-x, Xq, yo, Xi, J/i, ("CS" refers to the continuity and stationarity 

of the prediction strategies we are competing with.) 

Theorem 1 Suppose X and Y are compact metric spaces, T is a compact con- 
vex subset of a Banach space, and the loss function A(7, y) is continuous in 
(7, y) and convex in the variable 7 S V. There exists a CS universal prediction 
algorithm. 

A CS universal prediction algorithm will be constructed in the next section. 

Universal prediction strategies: compact randomized case 

When the loss function A(7,y) is not convex in 7, two difficulties appear: 

• the conclusion of Theorem ^ becomes false if the convexity requirement is 
removed ([IE], Theorem 2); 

• in some cases the notion of a continuous prediction strategy becomes vac- 
uous: e.g., there are no non-constant continuous stationary prediction 
strategies when T = {0,1} and (X x Y)°° x X is connected (the latter 
condition is equivalent to X and Y being connected — see JJ, Theorem 
6.1.15). 

To overcome these difficulties, we consider randomized prediction strategies. 
The proof of Theorem ^ will give a universal, in a natural sense, randomized 
prediction algorithm; on the other hand, there will be a vast supply of continuous 
stationary prediction strategies. 

Remark In fact, the second difficulty is more apparent than real: for example, 
in the binary case (Y = {0,1}) there are many non-trivial continuous prediction 
strategies in the canonical form of the prediction game [30| with the prediction 
space redefined as the boundary of the set of superpredictions |16| . 

A randomized prediction strategy is a function D : (X x Y)°° x X x 
{1, 2, . . .} — > V(T) mapping the past complemented by the current time to the 
probability measures on the prediction space; V(T) is always equipped with the 
topology of weak convergence (|3j; this topology is also discussed, in the com- 
pact case, in Section^below). In other words, this is a prediction strategy in the 
extended game of prediction with the prediction space V(T). Analogously, a sta- 
tionary randomized prediction strategy is a function D : (X x Y)°° x X — > T^fT). 
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Let us say that a randomized prediction strategy outputting j n is CS univer- 
sal for a loss function A if, for any continuous stationary randomized prediction 
strategy D and any biinfinite . . . , X-\, y-i, xq, yo, x\, y\, . . ., 



/ 1 N 1 N \ 

limsup — ^2 X (9n,y n ) ~ T7 X! K d n,y n ) < a.s., 

Ar ^°° V n=l n=l / 

where gi, g%, . . . , di, d%, . . . are independent random variables distributed as 



(2) 



9n~ln, (3) 

d n ~ D(..., x n -i,y n ^i,x n ), (4) 

n = 1,2,.... Intuitively, the "a.s." in @ refers to the prediction strategies' 
internal randomization. 

Theorem 2 Let X, r, and Y be compact metric spaces and X be a continuous 
loss function. There exists a CS universal randomized prediction algorithm. 



Simple reductions to the compact case 

In the following two subsections we will discuss the case where the signal, pre- 
diction, and observation spaces are not required to be compact. The goal of this 
subsection is to show that the compact case is not as special as it may seem, as 
far as Theorem [5] is concerned. The rest of the paper does not depend on this 
subsection. 

In general, we might consider X, T, and Y together with their fixed compact- 
ifications X, T, and Y (without loss of generality we can and will assume that X, 
r, and Y are dense in their compactifications, and then the compactifications 
will be the closures of the original spaces, which explains our notation). Let us 
suppose that A is bounded and continuous, and, moreover, can be continuously 
extended to the product T x Y of the compactifications; such an extension is 
then unique and will also be denoted A. 

If X, r, and Y are Euclidean spaces their natural compactifications might be 
chosen as Aleksandrov's one-point compactification ([H], Theorem 3.5.11), the 
corresponding projective space (with M.P L being the compactification of R L ), or 
the corresponding closed unit ball (with the interior of the closed unit ball in M. L 
identified with M. L by mapping a vector v of length / 6 [0, 1) in the former set to 
the vector (tan(7r//2))«). The Stone-Cech compactification (J5> Section 3.6) 
will usually be too large: we will want our compactifications to be metrizable. 

Theorem |2 will remain true if instead of assuming X, T, and Y to be metric 
compacts we assume that X, T, and Y are metric compacts and if in the defi- 
nition of CS universality J5J) we only consider continuous stationary prediction 
strategies that have a continuous extension to (X x Y)°° x X. 

Remark An elegant way to avoid considering compactifications would be to 
assume that X, T, and Y are metrizable proximity spaces (see |TT]) Section 8.4, 
or [23, where 53' s "proximity spaces" are called "separated proximity spaces") 
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and to consider only proximity prediction strategies. By Smirnov's theorem 
C[TT). Theorem 8.4.13 and also Theorem 8.4.9; [231 , Theorem 7.7) a proximity 
space can be identified with the corresponding topological space equipped with 
a compactification. Assuming that the loss function A is a bounded proximity 
function, it can be uniquely continuously extended to the compactification r x Y 
(\2'6\. Theorem 7.10), and every proximity stationary prediction strategy can be 
identified with a continuous function on the compactification (X x Y)°° x X (by 
the same theorem). To ensure that the compactifications are metrizable, it is 
sufficient to assume that the proximity spaces are second-countable (i.e., have 
countable proximity weights; see [231 , Theorem 8.14, and |11| . Theorem 4.2.8). 
We chose the slightly clumsier language of compactifications because the notion 
of a topological space is much more familiar than that of a proximity space. 



Universal prediction strategies: deterministic case 

Let us say that a set in a topological space is precompact if its closure is compact. 
In Euclidean spaces, precompactness means boundedness. In this and next 
subsections we drop the assumption of compactness of X, T, and Y, and so we 
have to redefine the notion of CS universality. 

A prediction strategy outputting j n 6 V{T) is CS universal for a loss func- 
tion A if, for any continuous stationary prediction strategy D and for any biin- 
finite . . . , X-i, x , y , Xi, yi, . . ., 

({. . . ,X-i,x Q ,x\, . . .} and {. . . , y , yi, . . .} are precompact) 

f 1 N 1 N \ 

=> limSUpl — ^ Kln,Vn) - Jr ^2 X ( D ^ ' ■ > x n-l ,Vn-l, X n ) , y n ) J < 0. 

(5) 

The intuition behind the antecedent of in the Euclidean case, is that the 
prediction algorithm knows that ||x n |j and ||y n || are bounded but does not know 
an upper bound in advance. 

Let us say that the loss function A is large at infinity if, for all y* € Y, 

lim A(7, y) = oo 
y^y* 

7 — >oo 

(in the sense that for each constant M there exists a neighborhood O y - 3 
y* and compact C C T such that X(T\C,O y *) C (M, oo)). Intuitively, we 
require that faraway 7 S T should be poor predictions for nearby y* G Y. This 
assumption is satisfied for most of the usual loss functions used in competitive 
on-line prediction. 

Theorem 3 Suppose X and Y are locally compact metric spaces, T is a convex 
subset of a Banach space, and the loss function A(7,y) is continuous, large at 
infinity, and convex in the variable 7 € L. There exists a CS universal prediction 
algorithm. 
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To have a specific example in mind, the reader might check that X = M. , 
T = Y = R L , and A(7, y) := \\y — j\\ satisfy the conditions of the theorem. 

Universal prediction strategies: randomized case 

We say that a randomized prediction strategy outputting randomized predic- 
tions 7 n is CS universal if, for any continuous stationary randomized prediction 
strategy D and for any biinfinite . . . , af_i, y—i, Xo, yo, Xi, yi, . . ., 

({. . .,x-i,x ,xi, . . .} and {. . . , y , yi, . . .} are precompact) 



where g± , g 2 , . . . , d±, d% , . . . are independent random variables distributed accord- 
ing to 

Theorem 4 Let X and Y be locally compact metric spaces, T be a metric space, 
and X be a continuous and large at infinity loss function. There exists a CS 
universal randomized prediction algorithm. 



In the rest of the paper we will be using the notation E for (X x Y)°° x X. 
By Tikhonov's theorem (|TI], Theorem 3.2.4) this is a compact space; it is also 
metrizable ( 11 , Theorem 4.2.2). Another standard piece of notation through- 
out the rest of the paper will be a n := (. . . , x n -i,y n -i,x n ) € E. Remember 
that A, as a continuous function on a compact set, is bounded below and above 
(HU, Theorem 3.10.6). 

Let r s be the set of all continuous functions from E to T with the topology 
of uniform convergence, generated by the metric 



p being the metric in T (induced by the norm in the containing Banach space). 
Since the topological space r E is separable f|ll|. Corollary 4.2.18 in combination 
with Theorem 4.2.8), we can choose a dense sequence D\, Z?2, ■ • • in T s . 

Remark The topology in T s is defined via a metric, and this is one the very 
few places in this paper where we need a specific metric (for brevity we often 
talk about "metric spaces", but this can always be replaced by "metrizable 
topological spaces"). Without using the metric, we could say that the topology 
in r E is the compact-open topology (^3> Section 3.4). Since E is compact, 
the compact-open topology on T E coincides with the topology of uniform con- 
vergence Theorem 4.2.17). The separability of T s now follows from jllj . 
Theorem 3.4.16 in combination with Theorem 4.2.8. 




3 Proof of Theorem [T] 



p(D u D 2 ) := sup p(D 1 (a),D 2 (a)), 



er<E£ 
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The next step is to apply Kalnishkan and Vyugin's ]JQ Weak Aggregating 
Algorithm (WAA) to this sequence. We cannot just refer to and will have to 
redo their derivation of the WAA's main property since Kalnishkan and Vyugin 
only consider the case of finitely many "experts" D k and finite Y. (Although in 
other respects we will not need their algorithm in full generality and so slightly 
simplify it.) 

Let q%, q2, . . . be a sequence of positive numbers summing to 1, X^fcLi Ik = 1- 
Define 

N 

Z<*> :=\(D k (a n ),y n ), := £ 

n=l 

to be the instantaneous loss of the fcth expert D/. on the nth round and his 
cumulative loss over the first N rounds. For all n, k = 1, 2, . . . define 

w { n k) := n ■= exp -= 

\ V n 

are the weights of the experts to use on round n) and 

(fc) 



fc=l 



„( fc ) 



(the normalized weights; it is obvious that the denominator is positive and 
finite). The WAA's prediction on round n is 

oo 

7»:=5>« )i3 *(*») (7) 
fc=i 

(the series is convergent in the Banach space since the compactness of T implies 
sup 7gr ||7|| < 00, and j n € T since 

K (fc) 
k=l l^k=\P n 

= Ef 1 - ^x 1 {k) )p ( n ) DM+ £ P^-M^O (8) 

fe=l V Z^fe=lP™ / k=K+l 

as if — > 00). 

Let l n := A(7„,y„) be the WAA's loss on round n and Ln ■— J2 n =i be 
its cumulative loss over the first N rounds. 

Lemma 1 ([16j, Lemma 9) The WAA guarantees that, for all N, 

N 00 N 00 



n=lfe=l n=l fe=l fe=l 



^ < EE^^» - E ^ £pW + ^ E a/*' ■ ( 9 ) 
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The first two terms on the right-hand side of © are sums over the first N rounds 
of different kinds of mean of the experts' losses (see, e.g., Chapter III, for 
a general definition of the mean) ; we will see later that they nearly cancel each 
other out. If those two terms are ignored, the remaining part of © is identical 
(except that (3 now depends on n) to the main property of the "Aggregating 
Algorithm" (see, e.g., Lemma 1). All infinite series in © are trivially 

convergent. 

Proof of Lemma ^ The proof is by induction on N. Assuming JjJJl, we obtain 

oo 

Ln+i = Ln + In+i < Ln + E^v+i^tv+i 

k=l 

N+l oo N oo 



^ - 'A 

n—l k—1 n—1 k—1 k—1 



< E E*« fe) - E E*w + E ft« 



(the first "<" used the "countable convexity" l n < J2kLi P^^ > which follows 
from |JS} and 

X E W g *(°"")> (fc) A (M^n), Vn) 

\k=l 2^k=lP n J k=l l^k=\P n 

if we let K — > oo). Therefore, it remains to prove 

00 (k) °° (fe) °° (fc) 

lo g/3jv ^ q k f%» < - lo g/3jv+i ^ pf+J^tl + \og 0N+1 1kA N + + i ■ 
fc=i fc=i fc=i 

By the definition of p« this can be rewritten as 

~, r< fc > l (k) 

oo ,,. v-^oo n q l n + i oo (ltA 

l0 S/3jv 9fc ^ < ~ iogftv + i ^(fe) + l0 g/3„ + i , 

fc=l EfeLl QkPn+1 k=l 

which after cancellation becomes 

°° (fe) 00 (fe) 

log/^E*^ <log fe+1 E*^+r ( l0 ) 

fe=l fe=l 

The last inequality follows from the general result about comparison of different 
means f|15|. Theorem 85), but we can also check it directly (following 2H). Let 
Pn+i = fl N , where < a < 1. Then I jlOjl can be rewritten as 



E*^ 

\fc=l / k=l 

and the last inequality follows from the concavity of the function t^t a . 
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Lemma 2 ([16j, Lemma 5) Let L be an upper bound on |A|. The WAA guar- 
antees that, for all N and K , 

L N < L ( P + \L?e L +ln— ) y/N. (11) 



IK, 

(There is no term e L in i^j since it only considers non-negative loss functions.) 
Proof From JJjJ, we obtain: 

N oo N oo / ,(fe)\ 



n—1 k—1 n—1 k—1 \ / 

2 



AT oo N 



n=l fc=l n=l \ fc=l 

N oo 2 



: L +\/jVln — 



^'^E^E# e») 

n=l v fc=l 



e 

qk 



r(K) LVA 1 r- , 1 L^e^ r v di /— , 1 

< + — - V — = + ViVln — < L<f > + — - / + v^ln — 



n=l 



< + L 2 e L ^N + VlVln — 

(in the second "<" we used the inequalities e* < l + t + ^e'*' and hit < t—1). I 

Now it is easy to prove Theorem^ Let 7„ be the predictions output by the 
WAA. Consider any continuous stationary prediction strategy D. Since every 
continuous function on a metric compact is uniformly continuous (|11|. Theorem 
4.3.32), for any e > we can find S > such that |A(7i,y) — A(72,y)| < e 
whenever ^(71,72) < 5. We can further find K such that p(Dx,D) < 5, and 
<|11|) then gives, for all biinfinite . . . , X-%, y-%,Xo, yo, Xx,y%, ■ ■ ., 

( I N 1 N \ 

limsup — VA(7„,2m)-— V X(D(a n ),y n ) 

V^^i N n^i J 

( 1 N 1 N \ 

< limsup ( — ^2 A(7„,y„) - — A(L> 7< (<r n ), y n ) J +e 
n^oo y n=1 n=1 y 

< lim sup f L 2 e L + In — ] — L= + e = e; 

N^oo V Ik J VN 

since e can be arbitrarily small the WAA is CS universal. 
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4 Proof of Theorem [2] 



Let us first recall some useful facts about the probability measures on a metric 
compact fi (we will be following [321 ) - The Banach space of all continuous real- 
valued functions on fi with the usual pointwise addition and scalar action and 
the sup norm will be denoted C(fi). By one of the Riesz representation theorems 
f [TO]. 7.4.1; see also 7.1.1), the mapping fi i— > 7 M , where I M (f) := f n fd[i, is a 
linear isometry between the set of all finite Borel signed measures /i on fi with 
the total variation norm and the dual space C'(fi) to C(O) with the standard 
dual norm ([22], Chapter 4). We will identify the finite Borel signed measures ^ 
on fi with the corresponding 7^ E C'(fi). This makes the set P(fi) of probability 
measures on fi a convex closed subset of C'(fi). 

We will be interested, however, in a different topology on C'(fi), the weakest 
topology for which all evaluation functionals ^ E C"(fi) h- > //(/), / € C(fi), are 
continuous. This topology is known as the weafc* topology ([27], 3.14), and 
the topology inherited by 'P(fi) is known as the topology of weak convergence 
(0) Appendix III). The point mass 5^, uj E fi, is defined to be the probability 
measure concentrated at uj, 5^({uj}) = 1. The simple example of a sequence 
of point masses 5 Un such that io n — ► u> as n — > oo and w n ^ cj for all n shows 
that the topology of weak convergence is different from the dual norm topology: 
S iAjn — > 5^ holds in one but does not hold in the other. 

It is not difficult to check that 'P(fi) remains a closed subset of C"(fi) in the 
weak* topology ([S], III. 2. 7, Proposition 7). By the Banach-Alaoglu theorem 
f|27|. 3.15) 'P(fi) is compact in the topology of weak convergence (this is a 
special case of Prokhorov's theorem, 3 , Appendix III, Theorem 6). In the rest 
of this paper, "P(fi) (and all other spaces of probability measures) are always 
equipped with the topology of weak convergence. 

Since fi is a metric compact, 'P(fi) is also metrizable (by the well-known 
Prokhorov metric: [3], Appendix III, Theorem 6). 



where 7 is a probability measure on T. This is the loss function in a new game 
of prediction with the prediction space 'P(r); it is convex in 7. 

Let us check that the loss function 1|12|) is continuous. If 7„ — > 7 and y n — > y 
for some (7, y) e V(T) x Y, 

|A(7n,y n ) ~ A(7,y)| < |A(7„,y n ) - A(7„,y)| + |A(7„,y) - X(j,y)\ -> 

(the first addend tends to zero because of the uniform continuity of A : T x Y — * 
K. and the second addend by the definition of the topology of weak convergence). 

Unfortunately, Theorem ^ cannot be applied to the new game of prediction 
directly: the theorem assumes that T is a subset of a Banach space, whereas 
the dual to an infinite-dimensional Banach space is never even metrizable in the 
weak* topology ([2Zj, 3.16). The proof of Theorem ^ however, still works for 
the new game. 



Define 




(12) 
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It is clear that the mixture 10 is a probability measure. The result of 
the previous section is still true, and the randomized prediction strategy J7J 
produces j n G V(T) that are guaranteed to satisfy 



( I N 1 N \ 

limsup — V A(7„,y„) - — V \{D(cr n ),y n ) < 0, 



(13) 



for any continuous stationary randomized prediction strategy D. The loss func- 
tion is bounded in absolute value by a constant L, and so the law of the iterated 
logarithm (see, e.g., [2E|, (5-8)) implies that 



lim sup 



E n =l(-%n,2/n) - Kln,Vnj) 



V2i 2 iVlnlniV 



lim sup 

N— >oo 



Y,n=i( X ( d n,y n ) - X(D(a n ),y n )) 



< 1, 



< 1 



(14) 
(15) 



\/2L 2 iVlnlniV 

with probability one. Combining the last two inequalities with Ijl^J) gives 

( 1 N 1 N \ 

limsup — V X(g n ,y n ) - — V A(d„, y n ) < a.s. 

Therefore, the WAA (applied to £>i, D 2l ■ ■ .) is a universal continuous random- 
ized prediction strategy. 



5 Proof of Theorem [3] 

In view of Theorem^ we only need to get rid of the assumption of compactness 
of X, T, and Y. 

Game of removal 

The proofs of Theorems|21andQ]will be based on the following game (an abstract 
version of the "doubling trick", [7]) played in a topological space X: 

Game of removal G{X) 

FOR n = 1,2,...: 

Remover announces compact K n C X. 

Evader announces p n ^ K n . 
END FOR. 

Winner: Evader if the set {pi,p2, ■ ■ ■} is precompact; Remover otherwise. 

Intuitively, the goal of Evader is to avoid being removed to the infinity. With- 
out loss of generality we will assume that Remover always announces a non- 
decreasing sequence of compact sets: K\ C K 2 C ■ • • . 
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Lemma 3 (Gruenhage) Remover has a winning strategy in G(X) if X is a 
locally compact and paracompact space. 



Proof We will follow the proof of Theorem 4.1 in J2| (the easy direction). If 
X is locally compact and cr-compact, there exists a non-decreasing sequence 
Ki C K2 C ■ • • of compact sets covering X, and each K n can be extended to 
compact K* so that IntK* D K n Theorem 3.3.2). Remover will obviously 
win G(X) choosing K^, K2, ... as his moves. 

If X is the sum of locally compact cr-compact spaces X s , s e S, Remover 
plays, for each s € S, the strategy described in the previous paragraph on the 
subsequence of Evader's moves belonging to X s . If Evader chooses p n € X s 
for infinitely many X s , those X s will form an open cover of the closure of 
{pi,P2, ■ ■ •} without a finite subcover. If chosen from only finitely many 

X s , there will be infinitely many x n chosen from some X s , and the result of the 
previous paragraph can be applied. It remains to remember that each locally 
compact paracompact can be represented as the sum of locally compact cr- 
compact subsets (JJ, Theorem 5.1.27). I 

Large at infinity loss functions 

We will need the following useful property of large at infinity loss functions. 

Lemma 4 Let X be a loss function that is large at infinity. For each compact 
set B C Y and each constant M there exists a compact set C C T such that 



Proof For each point y* e B fix a neighborhood O y * B y* and a compact set 
C(y*) C T such that A (T \ C(y*), O v *) C (M,oo). Since the sets O r form an 
open cover of B, we can find this cover's finite subcover {O y * , . . . , O y *}. It is 
clear that 



In fact, the only property of large at infinity loss functions that we will be using 
is that in the conclusion of Lemma In particular, it implies the following 
lemma. 

Lemma 5 Under the conditions of Theorem^ for each compact set B C Y 
there exists a compact convex set C — C'(B) C T such that for each continuous 
stationary prediction strategy D : £ — > T there exists a continuous stationary 
prediction strategy D' : E — > C that dominates D in the sense 



Vj<£C,y£B: A( 7 ,y)>M. 



(16) 




satisfies l(TT)|) . 



VcreE, ye 5: X(D'(a), y) < X(D(a), y). 



(17) 
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Proof Without loss of generality B is assumed non-empty. Fix any 70 £ T. 
Let 

Mi := sup A(7o,y), 

let C\ C r be a compact set such that 

Vj^C^yeB: X(j,y)>M 1 + l, 

let 

M 2 := sup A(7,y), 
(7,a)eCixB 

and let C2 C V be a compact set such that 

V 7 ^C 2 ,yeB: A( 7 ,y)>M 2 + l. 

It is obvious that Mi < M 2 and 70 £ Ci C C2. We can and will assume C 2 
convex (see j^Z], Theorem 3.20(c)). 

Let us now check that C\ lies inside the interior of C 2 . Indeed, for any fixed 
y e B and 7 e d, we have \("f,y) < M 2 ; since A(7', y) > M 2 + l for all 7' ^ C 2 , 
some neighborhood of 7 will lie completely in C 2 . 

Let I? : £ — > r be a continuous stationary prediction strategy. We will show 
that JT7|) holds for some continuous stationary prediction strategy D' taking 
values in the compact convex set C(B) := C 2 . Namely, we define 

D'(a) := 

'D(cr) ifD(cr)eCi 

p(D(a),r\C2) n /_N , p(D( CT ),Ci) . f n ,x r x r 

,70 if £>0) e r \ c 2 

where p is the metric on T; the denominator p(D(a),Ci) + p{D{a) 1 T \ C 2 ) is 
positive since already p(D(a),Ci) is positive. Since C 2 is convex, we can see 
that D' indeed takes values in C 2 . The only points x at which the continuity of 
D 1 is not obvious are those for which D{a) lies on the boundary of G\\ in this 
case one has to use the fact that C% is covered by the interior of C 2 . 

It remains to check (|17H : the only non-trivial case is D(cr) € C 2 \ C\. By the 
convexity of A(7, y) in 7, the inequality in l|17|l will follow from 

^ ((T) ' rW2) -X(D(a),y) 



p(D(a),C 1 )+p(D(a),T\C 2) 



A( 7 o,y) < HD(a),y), 



/ o(£>(a),C 1 )+p( J D(a),r\C 2 ) 
i.e., 

\("fo,y) < \(D(a),y). 

Since the left-hand side of the last inequality is at most Mi and its right-hand 
side exceeds Mi + 1, it holds true. I 
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Remark If the loss function is allowed to depend on the infinite past, the <rs 
in Lemma |S] will have to be restricted to a compact set ACT, and the compact 
set C will depend not only on B but also on A (see Lemma 18 of 



The proof 

For each compact B C Y fix a compact convex C(B) C F as in Lemma El Pre- 
dictor's strategy ensuring (JjjJ is constructed from Remover's winning strategy 
in G(X x Y) (see Lemma |21 metric spaces are paracompact by the Stone the- 
orem, JT], Theorem 5.1.3) and from Predictor's strategies S(A, B) outputting 
predictions 



under the assumption that (x n ,y n ) G A x B for given compact A C X and 
B C Y (the existence of such S(A, B) is asserted in Theorem 0. Remover's 
moves are assumed to be of the form A x B for compact 4CX and B C Y. 
Predictor is simultaneously playing the game of removal G(X x Y) as Evader. 

At the beginning of the game of prediction Predictor asks Remover to make 
his first move A\ x B\ in the game of removal; without loss of generality we 
assume that A\ x B\ contains all (x n ,y n ), n < (there is nothing to prove if 
{{xniVn) I n < 0} is not precompact). Predictor then plays the game of pre- 
diction using the strategy S(A 1: Bi) until Reality chooses (x n ,y n ) $ A\ X B\ 
(forever if Reality never chooses such (x n ,y n )). As soon as such (x n , y n ) is cho- 
sen, Predictor announces (x n ,y n ) in the game of removal and notes Remover's 
response (A 2 , B2). He then continues playing the game of prediction using the 
strategy S(A 2 ,B 2 ) until Reality chooses (x n ,y n ) A 2 x B 2l etc. 

Let us check that this strategy for Predictor will always ensure 10. If Re- 
ality chooses (x n ,y n ) outside Predictor's current Ak x Bk finitely often, the 
consequent of will be satisfied for all continuous stationary D : S — > C{Bk) 
{Bk being the second component of Remover's last move (Ak,Bk)) and so, 
by Lemma for all continuous stationary D : S — »• F. If Reality chooses 
(x n ,y n ) outside Predictor's current Ak x Bk infinitely often, the set of (x n ,y n ), 
n = 1,2,..., will not be precompact, and so the antecedent of JHJ will be vio- 
lated. 

6 Proof of Theorem HI 

When 7 ranges over V(C) (identified with the subset of ViT) consisting of the 
measures concentrated on C) for a compact CCT, the loss function |Q , as 
we have seen, is continuous. The following analogue of Lemma|S]will be useful. 




(18) 



D : (Ax B)°° xA^ C(B) 



(19) 
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Lemma 6 Under the conditions of Theorem^ for each compact set B C Y 
there exists a compact convex set C = C(B) C T such that for each continuous 
stationary randomized prediction strategy D : £ — > V(T) there exists a contin- 
uous stationary randomized prediction strategy D' : £ — > V(C) such that \lT\j 
holds (D' dominates D "on average"). 

(In fact, this lemma is not needed for the proof of Theorem0]as we stated it, but 
it will imply that j n dominate D(o~ n ) on average, for any continuous stationary 
randomized prediction strategy D: see Q2U[1.) 

Proof Define 70, M%, C\, M%, and C2 as in the proof of Lemma [SJ Fix a 
continuous function f% : T — > [0, 1] such that f% = 1 on C\ and f\ = on T \ C2 
(such an f\ exists by the Tietze-Uryson theorem, Theorem 2.1.8). Set 
fi := X — f±. Let D : £ — > 'P(r) be a continuous stationary randomized pre- 
diction strategy. For each u6E, split -D(cr) into two measures on T absolutely 
continuous with respect to D(a): D±(a) with Radon-Nikodym density f\ and 
^2(17) with Radon-Nikodym density f^\ set 

D'{a) :=Di(a) + \D 2 (a)\6^ 

(letting \P\ := P(T) for P a measure on T). It is clear that the stationary 
randomized prediction strategy D' is continuous (in the topology of weak con- 
vergence, as usual), takes values in V{Ci), and 

X(D'(a), y) = J A( 7 , y)f 1 ( 1 )D(a)(d 1 ) + A( 7o , v) J / 2 ( 7 )^)(d 7 ) 
< J A( 7 ,y)/i (l)D(a) (d 7 ) + J M 1 f 2 {i)D{&){6r f ) 
< J A( 7l 2/)/i( 7 )£>(<r)(d 7 ) + jf A( 7 , y)f 2 ( 1 )D(a)(d 1 ) = X(D(a),y) 

for all (cr, y) eT. x B. So we can take C(-B) := C 2 . I 

Fix one of the mappings B 1— > C(B) whose existence is asserted by the lemma. 

We will prove that the prediction strategy of the previous section with H18|l 
replaced by 7n e V(C(B)) and ifB^ replaced by 

D: (Ax B)°° x A -► V(C{B)) 

is CS universal. Let L> : £ — > "P(r) be a continuous stationary randomized 
prediction strategy, i.e., a continuous stationary prediction strategy in the new 
game of prediction with loss function (|12fl . Let (Ak,Bk) be Remover's last 
move (if Remover makes infinitely many moves, the antecedent of Jfjjl is false, 
and there is nothing to prove), and let D 1 : £ — * V(C(Bk)) be a continuous 
stationary randomized prediction strategy satisfying (|17J) with B :— Bk- From 
some n on our randomized prediction algorithm produces 7 „ € 'P(r) concen- 
trated on C(Bk), and they will satisfy 
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( 1 N 1 N \ 

limsup — V A( 7n ,y n ) - — V \{D(a n ),y n ) 
V^n^l N n^l J 

( 1 N 1 N \ 

< limsup I -^A(7„,y n )--^A(£>'K) ) j/„)j < 0. (20) 

\ n—1 n—1 / 

This is an interesting property but slightly different from what Theorem ^as- 
serts. 

According to the proof of Lemma|Bl we can, and we will, assume that D'(a n ) 
generates outcomes d' n in two steps: first d n is generated from D(a n ), and then 
it is replaced by 70 with probability ^(u,,). The loss function is bounded in 
absolute value on the compact set C(Bk) x Bk by a constant L. From the law 
of the iterated logarithm (see l|14|) and (| 1 5|1 ^1 applied to the losses of 7„ and d! n 
we now obtain, instead of (|20l) . 

/ 1 N 1 N \ 

limsup — ^2 X(g n ,y n ) - — ^ A ( d ™> Vn) 
n-*co \^ n=1 n=1 y 

/ 1 N 1 N \ 

< limsup I — ^ X(g n ,y n ) - — ^ A«,y„) j 

N^oo y n—1 n—1 / 

/ 1 w 1 w \ 

= limsup I — ^ A(7 n ,y n ) - -^X] A ( £, '( (T n)'y™) ) ^ a - s -; 
it remains to compare this with ©. 



7 Stationarity and continuity 

As we said earlier, the assumption of stationarity is very natural for prediction 
strategies: it just means that the arbitrary origin of time is not taken into 
account (in the spirit of the invariance principle in statistics; see, e.g., |21) . 
Section 6.1). Stationary strategies can detect and make use of all kinds of trends 
and one-off phenomena; e.g., they can perform well when the rate of environment 
change is constantly increasing (as in our own environment). There need not 
be stationarity in the environment. 

Interestingly, our prediction algorithms are continuous (or can be made con- 
tinuous) but not stationary. First we discuss the continuity of the prediction 
algorithms constructed in the proofs of our four theorems. 

Theorem ^ It is easy to check that the WAA is continuous; by the Weierstrass 
M-test, © converges uniformly and so its sum is continuous. 

Theorem [2] To check that j n is a continuous function of a n in the topology 
of weak convergence, we only need to check that J f dj n is a continuous 
function of a n for each / <E C(S). This again follows from the Weierstrass 
M-test. 
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Theorem QH As described, Predictor's strategy is not continuous since his be- 
havior changes suddenly when Reality outputs {x n ,y n ) outside his current 
Ak x Bk, but it is clear that it can be "smoothed around the edges" to 
ensure continuity. 

Theorem |3| The situation is analogous to Theorem [21 

For concreteness, we will discuss stationarity only in the case of Theorem ^ 
We know that the WAA is a prediction strategy that is continuous as a function 
of the type £ X {1, 2, ...}—► T. It is not stationary (i.e., we cannot get rid of the 
{1, 2, . . .}) because it has to keep track of the experts' losses since the beginning 
of the game of prediction. Stationary strategies can depend on time only in a 
limited way: e.g., in terms of our own environment, they can depend on the 
time of day or the season. But the WAA's dependence is much heavier: it has 
to know precisely the time that has elapsed since the beginning. 

Let us now check that there are no universal continuous stationary prediction 
strategies under conditions of Theorem Suppose T is such that there exists 
/ : r — > r without fixed points (i.e., f(j) ^ 7 for all 7 € T; we can take, 
e.g., a circle as T). If D were a universal continuous stationary strategy, we 
could define another continuous stationary strategy D'(a) := f{D{a)) and make 
Reality collude with D' (i.e., output y n leading to a significantly smaller loss for 
D'; this can be done for an appropriate choice of A, and in fact can be done for 
all usual A). 

Stationary Reality 

A standard problem in probability theory is where Reality is governed by a 
stationary probability measure; of course, only stationary prediction strategies 
are considered. In this subsection we will list several references for this prob- 
lem, considering, for simplicity, only the case where the signals absent 
(formally, we assume that X is a one-element set and omit the x n , which now 
do not carry any information, from our notation). 

The problem of prediction has been studied extensively for both strictly 
stationary sequences of observations and wide sense stationary sequences (the 
definitions and a general discussion of "strict sense" and "wide sense" concepts 
can be found in 0, Chapter 2, Sections 8 and 3). We will first assume that 
. . . , y—i, yo, yi, ■ . ■ form a wide sense stationary sequence of random variables 
and then a strictly stationary sequence. 

The natural mode of prediction for wide sense stationary sequences is linear 
prediction. The problem of linear prediction (not necessarily one-step-ahead, 
as in this paper) of wide sense stationary sequences was posed and solved by 
Kolmogorov ^[EDEHi later but independently this was done by Wiener |3*3]. 

Kolmogorov and Wiener assumed the probability distribution of the obser- 
vations known. There are many efficient ways to estimate the spectral density 
of this probability distribution (in terms of which the optimal linear predictor 
is expressed); see, e.g., 0, Chapter 9, for a review. (An early idea of spectral 
estimation was proposed by Einstein in 1914: see [23]) P- 363.) 
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The problem of existence of universal prediction strategies for strictly sta- 
tionary and ergodic sequences of observations was posed by Cover |S] , and such 
strategies were found by Ornstein for finite Y and Algoet pp for Y a Polish 
space. Papers ^JEBES] construct such strategies using techniques very similar 
to those of this paper. 

8 Conclusion 

An interesting direction of further research is to obtain non-asymptotic versions 
of our results. If the benchmark class of continuous stationary prediction strate- 
gies is compact, loss bounds can be given in terms of e-entropy [201 ■ In general, 
one can give loss bounds in terms of a nested family of compact sets whose union 
is dense in the set of continuous stationary prediction strategies (in analogy with 
Vapnik and Chervonenkis's principle of structural risk minimization |29p. 
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